Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 73
Filtrar
1.
bioRxiv ; 2024 Mar 20.
Artigo em Inglês | MEDLINE | ID: mdl-38562829

RESUMO

The secreted mucins MUC5AC and MUC5B play critical defensive roles in airway pathogen entrapment and mucociliary clearance by encoding large glycoproteins with variable number tandem repeats (VNTRs). These polymorphic and degenerate protein coding VNTRs make the loci difficult to investigate with short reads. We characterize the structural diversity of MUC5AC and MUC5B by long-read sequencing and assembly of 206 human and 20 nonhuman primate (NHP) haplotypes. We find that human MUC5B is largely invariant (5761-5762aa); however, seven haplotypes have expanded VNTRs (6291-7019aa). In contrast, 30 allelic variants of MUC5AC encode 16 distinct proteins (5249-6325aa) with cysteine-rich domain and VNTR copy number variation. We grouped MUC5AC alleles into three phylogenetic clades: H1 (46%, ~5654aa), H2 (33%, ~5742aa), and H3 (7%, ~6325aa). The two most common human MUC5AC variants are smaller than NHP gene models, suggesting a reduction in protein length during recent human evolution. Linkage disequilibrium (LD) and Tajima's D analyses reveal that East Asians carry exceptionally large MUC5AC LD blocks with an excess of rare variation (p<0.05). To validate this result, we used Locityper for genotyping MUC5AC haplogroups in 2,600 unrelated samples from the 1000 Genomes Project. We observed signatures of positive selection in H1 and H2 among East Asians and a depletion of the likely ancestral haplogroup (H3). In Africans and Europeans, H3 alleles show an excess of common variation and deviate from Hardy-Weinberg equilibrium, consistent with heterozygote advantage and balancing selection. This study provides a generalizable strategy to characterize complex protein coding VNTRs for improved disease associations.

2.
bioRxiv ; 2024 Apr 20.
Artigo em Inglês | MEDLINE | ID: mdl-38659906

RESUMO

Structural variants (SVs) contribute significantly to human genetic diversity and disease 1-4 . Previously, SVs have remained incompletely resolved by population genomics, with short-read sequencing facing limitations in capturing the whole spectrum of SVs at nucleotide resolution 5-7 . Here we leveraged nanopore sequencing 8 to construct an intermediate coverage resource of 1,019 long-read genomes sampled within 26 human populations from the 1000 Genomes Project. By integrating linear and graph-based approaches for SV analysis via pangenome graph-augmentation, we uncover 167,291 sequence-resolved SVs in these samples, considerably advancing SV characterization compared to population-wide short-read sequencing studies 3,4 . Our analysis details diverse SV classes-deletions, duplications, insertions, and inversions-at population-scale. LINE-1 and SVA retrotransposition activities frequently mediate transductions 9,10 of unique sequences, with both mobile element classes transducing sequences at either the 3'- or 5'-end, depending on the source element locus. Furthermore, analyses of SV breakpoint junctions suggest a continuum of homology-mediated rearrangement processes are integral to SV formation, and highlight evidence for SV recurrence involving repeat sequences. Our open-access dataset underscores the transformative impact of long-read sequencing in advancing the characterisation of polymorphic genomic architectures, and provides a resource for guiding variant prioritisation in future long-read sequencing-based disease studies.

3.
bioRxiv ; 2024 Feb 16.
Artigo em Inglês | MEDLINE | ID: mdl-38529499

RESUMO

Haplotype information is crucial for biomedical and population genetics research. However, current strategies to produce de-novo haplotype-resolved assemblies often require either difficult-to-acquire parental data or an intermediate haplotype-collapsed assembly. Here, we present Graphasing, a workflow which synthesizes the global phase signal of Strand-seq with assembly graph topology to produce chromosome-scale de-novo haplotypes for diploid genomes. Graphasing readily integrates with any assembly workflow that both outputs an assembly graph and has a haplotype assembly mode. Graphasing performs comparably to trio-phasing in contiguity, phasing accuracy, and assembly quality, outperforms Hi-C in phasing accuracy, and generates human assemblies with over 18 chromosome-spanning haplotypes.

4.
Genome Biol ; 25(1): 26, 2024 Jan 19.
Artigo em Inglês | MEDLINE | ID: mdl-38243222

RESUMO

Potato is one of the world's major staple crops, and like many important crop plants, it has a polyploid genome. Polyploid haplotype assembly poses a major computational challenge. We introduce a novel strategy for the assembly of polyploid genomes and present an assembly of the autotetraploid potato cultivar Altus. Our method uses low-depth sequencing data from an offspring population to achieve chromosomal clustering and haplotype phasing on the assembly graph. Our approach generates high-quality assemblies of individual chromosomes with haplotype-specific sequence resolution of whole chromosome arms and can be applied in common breeding scenarios where collections of offspring are available.


Assuntos
Solanum tuberosum , Tetraploidia , Humanos , Haplótipos , Análise de Sequência de DNA , Solanum tuberosum/genética , Melhoramento Vegetal , Poliploidia
6.
Bioinform Adv ; 3(1): vbad167, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38145107

RESUMO

Motivation: Compared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space are still challenging due to this high diversity. In contrast, amino acid sequences have higher similarity due to evolutionary constraints, whereby a single amino acid may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, panproteomes present an attractive alternative leveraging the higher sequence similarity while not losing much of the genome in non-coding regions. Results: We present PanPA, a method that takes a set of multiple sequence alignments of protein sequences, indexes them, and builds a graph for each multiple sequence alignment. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase that PanPA generates correct alignments on a panproteome from 1350 Escherichia coli. To demonstrate that panproteomes allow comparisons at longer phylogenetic distances, we compare DNA and protein alignments from 1073 Salmonella enterica assemblies against E.coli reference genome, pangenome, and panproteome using BWA, GraphAligner, and PanPA, respectively; with PanPA aligning around 22% more sequences. We also aligned a DNA short-reads whole genome sequencing (WGS) sample from S.enterica against the E.coli reference with BWA and the panproteome with PanPA, where PanPA was able to find alignment for 68% of the reads compared to 5% with BWA. Availalability and implementation: PanPA is available at https://github.com/fawaz-dabbaghieh/PanPA.

7.
Bioinform Adv ; 3(1): vbad149, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37928341

RESUMO

Motivation: For genotype and haplotype inference, typically, sequencing reads aligned to a reference genome are used. The alignments identify the genomic origin of the reads and help to infer the absence or presence of sequence variants in the genome. Since long sequencing reads often come with high rates of systematic sequencing errors, single nucleotides in the reads are not always correctly aligned to the reference genome, which can thus lead to wrong conclusions about the allele carried by a sequencing read at the variant site. Thus, allele detection is not a trivial task, especially for single-nucleotide polymorphisms and indels. Results: To learn the characteristics of sequencing errors, we introduce a method to create an error model in non-variant regions of the genome. This information is later used to distinguish sequencing errors from alternative alleles in variant regions. We show that our method, k-merald, improves allele detection accuracy leading to better genotyping performance as compared to the existing WhatsHap implementation using edit-distance-based allele detection, with a decrease of 18% and 24% in error rate for high-coverage Oxford Nanopore and PacBio CLR sequencing reads for sample HG002, respectively. We additionally observed a prominent improvement in genotyping performance for sequencing data with low coverage. For 3× coverage Oxford Nanopore sequencing data, the genotyping error rate reduced from 34% to 31%, corresponding to a 9% decrease. Availability and implementation: https://github.com/whatshap/whatshap.

8.
Algorithms Mol Biol ; 18(1): 15, 2023 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-37775806

RESUMO

Homologous recombination between the maternal and paternal copies of a chromosome is a key mechanism for human inheritance and shapes population genetic properties of our species. However, a similar mechanism can also act between different copies of the same sequence, then called non-allelic homologous recombination (NAHR). This process can result in genomic rearrangements-including deletion, duplication, and inversion-and is underlying many genomic disorders. Despite its importance for genome evolution and disease, there is a lack of computational models to study genomic loci prone to NAHR. In this work, we propose such a computational model, providing a unified framework for both (allelic) homologous recombination and NAHR. Our model represents a set of genomes as a graph, where haplotypes correspond to walks through this graph. We formulate two founder set problems under our recombination model, provide flow-based algorithms for their solution, describe exact methods to characterize the number of recombinations, and demonstrate scalability to problem instances arising in practice.

9.
Nature ; 621(7978): 355-364, 2023 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-37612510

RESUMO

The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.


Assuntos
Cromossomos Humanos Y , Evolução Molecular , Humanos , Masculino , Cromossomos Humanos Y/genética , Genoma Humano/genética , Genômica , Taxa de Mutação , Fenótipo , Eucromatina/genética , Pseudogenes , Variação Genética/genética , Cromossomos Humanos X/genética , Regiões Pseudoautossômicas/genética
10.
Hemasphere ; 7(8): e925, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37469802

RESUMO

The mutational landscape of B-cell precursor acute lymphoblastic leukemia (BCP-ALL), the most common pediatric cancer, is not fully described partially because commonly applied short-read next generation sequencing has a limited ability to identify structural variations. By combining comprehensive analysis of structural variants (SVs), single-nucleotide variants (SNVs), and small insertions-deletions, new subtype-defining and therapeutic targets may be detected. We analyzed the landscape of somatic alterations in 60 pediatric patients diagnosed with the most common BCP-ALL subtypes, ETV6::RUNX1+ and classical hyperdiploid (HD), using conventional cytogenetics, single nucleotide polymorphism (SNP) array, whole exome sequencing (WES), and the novel optical genome mapping (OGM) technique. Ninety-five percent of SVs detected by cytogenetics and SNP-array were verified by OGM. OGM detected an additional 677 SVs not identified using the conventional methods, including (subclonal) IKZF1 deletions. Based on OGM, ETV6::RUNX1+ BCP-ALL harbored 2.7 times more SVs than HD BCP-ALL, mainly focal deletions. Besides SVs in known leukemia development genes (ETV6, PAX5, BTG1, CDKN2A), we identified 19 novel recurrently altered regions (in n ≥ 3) including 9p21.3 (FOCAD/HACD4), 8p11.21 (IKBKB), 1p34.3 (ZMYM1), 4q24 (MANBA), 8p23.1 (MSRA), and 10p14 (SFMBT2), as well as ETV6::RUNX1+ subtype-specific SVs (12p13.1 (GPRC5A), 12q24.21 (MED13L), 18q11.2 (MIB1), 20q11.22 (NCOA6)). We detected 3 novel fusion genes (SFMBT2::DGKD, PDS5B::STAG2, and TDRD5::LPCAT2), for which the sequence and expression were validated by long-read and whole transcriptome sequencing, respectively. OGM and WES identified double hits of SVs and SNVs (ETV6, BTG1, STAG2, MANBA, TBL1XR1, NSD2) in the same patient demonstrating the power of the combined approach to define the landscape of genomic alterations in BCP-ALL.

11.
Genome Res ; 33(4): 496-510, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-37164484

RESUMO

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6-7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.


Assuntos
DNA Satélite , Polimorfismo Genético , Humanos , DNA Satélite/genética , Haplótipos , Duplicações Segmentares Genômicas , Análise de Sequência de DNA
12.
Nat Biotechnol ; 2023 May 10.
Artigo em Inglês | MEDLINE | ID: mdl-37165083

RESUMO

Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.

13.
bioRxiv ; 2023 May 04.
Artigo em Inglês | MEDLINE | ID: mdl-37205567

RESUMO

Advances in long-read sequencing (LRS) technology continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant calling precision and recall of Oxford Nanopore Technologies (ONT) and PacBio HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant calling precision and recall of SVs and indels in HiFi datasets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant callsets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.

14.
Genome Biol ; 24(1): 100, 2023 04 30.
Artigo em Inglês | MEDLINE | ID: mdl-37122002

RESUMO

The telomere-to-telomere (T2T) complete human reference has significantly improved our ability to characterize genome structural variation. To understand its impact on inversion polymorphisms, we remapped data from 41 genomes against the T2T reference genome and compared it to the GRCh38 reference. We find a ~ 21% increase in sensitivity improving mapping of 63 inversions on the T2T reference. We identify 26 misorientations within GRCh38 and show that the T2T reference is three times more likely to represent the correct orientation of the major human allele. Analysis of 10 additional samples reveals novel rare inversions at chromosomes 15q25.2, 16p11.2, 16q22.1-23.1, and 22q11.21.


Assuntos
Genoma Humano , Polimorfismo Genético , Humanos , Variação Estrutural do Genoma , Inversão Cromossômica
15.
Nat Biotechnol ; 41(6): 832-844, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-36424487

RESUMO

Somatic structural variants (SVs) are widespread in cancer, but their impact on disease evolution is understudied due to a lack of methods to directly characterize their functional consequences. We present a computational method, scNOVA, which uses Strand-seq to perform haplotype-aware integration of SV discovery and molecular phenotyping in single cells by using nucleosome occupancy to infer gene expression as a readout. Application to leukemias and cell lines identifies local effects of copy-balanced rearrangements on gene deregulation, and consequences of SVs on aberrant signaling pathways in subclones. We discovered distinct SV subclones with dysregulated Wnt signaling in a chronic lymphocytic leukemia patient. We further uncovered the consequences of subclonal chromothripsis in T cell acute lymphoblastic leukemia, which revealed c-Myb activation, enrichment of a primitive cell state and informed successful targeting of the subclone in cell culture, using a Notch inhibitor. By directly linking SVs to their functional effects, scNOVA enables systematic single-cell multiomic studies of structural variation in heterogeneous cell populations.


Assuntos
Cromotripsia , Leucemia , Neoplasias , Humanos , Neoplasias/genética , Leucemia/genética , Rearranjo Gênico , Linhagem Celular , Variação Estrutural do Genoma
16.
Methods Mol Biol ; 2590: 127-138, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-36335496

RESUMO

WhatsHap is a command-line tool for phasing and phasing-related tasks. It allows to infer haplotypes in diploid and polyploid samples based on (preferably long) reads covering at least two heterozygous variants. It offers additional tools for working with phased variant calls such as computing statistics, comparing different phasings and assigning reads in alignment files to their haplotype.


Assuntos
Diploide , Poliploidia , Humanos , Análise de Sequência de DNA , Haplótipos/genética , Heterozigoto , Algoritmos , Polimorfismo de Nucleotídeo Único
17.
Genome Res ; 33(12): 2029-2040, 2023 Dec 27.
Artigo em Inglês | MEDLINE | ID: mdl-38190646

RESUMO

Advances in long-read sequencing (LRS) technologies continue to make whole-genome sequencing more complete, affordable, and accurate. LRS provides significant advantages over short-read sequencing approaches, including phased de novo genome assembly, access to previously excluded genomic regions, and discovery of more complex structural variants (SVs) associated with disease. Limitations remain with respect to cost, scalability, and platform-dependent read accuracy and the tradeoffs between sequence coverage and sensitivity of variant discovery are important experimental considerations for the application of LRS. We compare the genetic variant-calling precision and recall of Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio) HiFi platforms over a range of sequence coverages. For read-based applications, LRS sensitivity begins to plateau around 12-fold coverage with a majority of variants called with reasonable accuracy (F1 score above 0.5), and both platforms perform well for SV detection. Genome assembly increases variant-calling precision and recall of SVs and indels in HiFi data sets with HiFi outperforming ONT in quality as measured by the F1 score of assembly-based variant call sets. While both technologies continue to evolve, our work offers guidance to design cost-effective experimental strategies that do not compromise on discovering novel biology.


Assuntos
Genômica , Nanoporos , Mutação INDEL , Sequenciamento Completo do Genoma
18.
Cell Genom ; 2(5)2022 May.
Artigo em Inglês | MEDLINE | ID: mdl-36452119

RESUMO

Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.

19.
Nature ; 611(7936): 519-531, 2022 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-36261518

RESUMO

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Assuntos
Mapeamento Cromossômico , Diploide , Genoma Humano , Genômica , Humanos , Mapeamento Cromossômico/normas , Genoma Humano/genética , Haplótipos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Análise de Sequência de DNA/métodos , Análise de Sequência de DNA/normas , Padrões de Referência , Genômica/métodos , Genômica/normas , Cromossomos Humanos/genética , Variação Genética/genética
20.
Nat Methods ; 19(10): 1230-1233, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36109679

RESUMO

Complex structural variants (CSVs) encompass multiple breakpoints and are often missed or misinterpreted. We developed SVision, a deep-learning-based multi-object-recognition framework, to automatically detect and haracterize CSVs from long-read sequencing data. SVision outperforms current callers at identifying the internal structure of complex events and has revealed 80 high-quality CSVs with 25 distinct structures from an individual genome. SVision directly detects CSVs without matching known structures, allowing sensitive detection of both common and previously uncharacterized complex rearrangements.


Assuntos
Aprendizado Profundo , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...